We compare the top \(n\) (\(n\) = 500, 1000, 2000) rows scored by four methods: SD (standard deviation), CV (coefficient of variation), MAD (median absolute deviation) and ATC (ability to correlate to other rows) with five datasets: Golub leukemia dataset, HSMM single cell RNASeq dataset, MCF10CA single cell RNASeq dataset, Ritz ALL dataset and TCGA GBM microarray dataset. For each dataset, we visualize the top rows (or, in other words, the top genes) by heatmaps (where rows are scaled by \(z\)-score method) and Euler diagram.

Golub leukemia dataset

top n = 500

Figure 1A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 1A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

top n = 1000

Figure 1B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 1B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

top n = 2000

Figure 1C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 1C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

Figure 2C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset

HSMM single cell RNASeq dataset

top n = 500

Figure 3A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 3A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

top n = 1000

Figure 3B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 3B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

top n = 2000

Figure 3C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 3C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

Figure 4C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset

MCF10CA single cell RNASeq dataset

top n = 500

Figure 5A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 5A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

top n = 1000

Figure 5B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 5B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

top n = 2000

Figure 5C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 5C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Figure 6C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset

Ritz ALL dataset

top n = 500

Figure 7A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 7A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

top n = 1000

Figure 7B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 7B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

top n = 2000

Figure 7C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 7C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

Figure 8C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset

TCGA GBM microarray dataset

top n = 500

Figure 9A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 9A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

top n = 1000

Figure 9B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 9B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

top n = 2000

Figure 9C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 9C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Figure 10C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset

Uniqueness of the top rows

We calculate the uniqueness of the top 1000 rows (genes) by the four top-value methods for GDS cohort (206 datasets) and recount2 cohort (223 datasets). The uniqueness for a top-value method is calculated as the fraction of the top 1000 genes that are not in the top 1000 genes by any of the other three methods. E.g., for one dataset, the uniqueness of ATC method is calculated as:

length(setdiff(S_ATC, union(S_SD, union(S_CV, S_MAD))))/1000

where S_ATC, S_SD, S_CV and S_MAD are the sets of top 1000 genes under each method.

In the following boxplots (Figure 11), we find for the GDS datasets which are microarray datasets, generally, ATC method extracts top genes which are more unique compared to other three methods (mean uniqueness, SD: 0.131, CV: 0.295, MAD: 0.252, ATC: 0.851). For recount2 datasets which are RNASeq datasets, CV method also extracts quite large fraction of unique top genes (mean uniqueness, SD: 0.139, CV: 0.672, MAD: 0.259, ATC: 0.743). Since RNASeq can measure genes which have very low expression (according to the recount2 pipeline), we guess the high fraction of CV-unique top genes is due to the lowly expressed genes in the recount2 datasets (recall CV is defined as the standard deviation dividing the mean where small mean can give large CV value).

Figure 11. Uniqueness of top-value methods in GDS datasets and recount2 datasets

Figure 11. Uniqueness of top-value methods in GDS datasets and recount2 datasets

Next we check the base mean for the top genes (The base mean is the mean absolute expression level for genes). To make the base mean comparable among datasets, the base mean values are replaced by the corresponding rank normalized by the total number of genes in that dataset, (calculated as rank(base_mean)/length(base_mean)). For each top-value method in each dataset, the mean rank for the top 1000 genes is used to measure the average base expression level.

In the following boxplots in Figure 12, we see very clearly that in recount2 datasets, the top 1000 genes by CV method have much lower expression than the top 1000 genes by other top-value methods.

Figure 12. Base expression level of top 1000 genes in GDS and recount2 datasets

Figure 12. Base expression level of top 1000 genes in GDS and recount2 datasets

Figure 13 visualizes the average pair-wise overlap among datasets. The value for the mean overlap between method \(i\) and \(j\) is defined as:

\(\frac{1}{n}\sum_k^n{p_{ijk}}\)

where the overlap between method \(i\) and \(j\) in dataset \(k\) which is \(p_{ijk}\) is defined as:

\(p_{ijk} = \frac{| S_{ik} \bigcap S_{jk} |}{1000}\)

where \(S_{ik}\) and \(S_{jk}\) are the sets of top 1000 genes extracted by method \(i\) and \(j\) in dataset \(k\).

Figure 13. The average pair-wise overlap among datasets

Figure 13. The average pair-wise overlap among datasets